feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396
feat(app): migrate app and ante telemetry to OpenTelemetry Meter API#3396amir-deris wants to merge 18 commits into
Conversation
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3396 +/- ##
==========================================
+ Coverage 59.34% 59.37% +0.02%
==========================================
Files 2112 2114 +2
Lines 174772 174919 +147
==========================================
+ Hits 103724 103859 +135
- Misses 62010 62019 +9
- Partials 9038 9041 +3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
bdchatham
left a comment
There was a problem hiding this comment.
Migration shape is right and dual-emit is sound. Three blockers worth fixing before merge — all silent (no runtime error, just wrong data):
- Counter and histogram names will double-suffix under the Prometheus exporter (
_total_total,_seconds_seconds). app_tx_count_totaldouble-counts every tx via the unlabeled + labeledAddpair.txGasUsed/txGasWantedasInt64Gaugeis last-write-wins under OCC concurrency.
Plus should-fixes around bucket density at the p99=2.5s SLO threshold, the lazy sync.OnceValue init in app/ante/, and the raw proposer label encoding. Inline below.
Dashboard rewrites land alongside the PLT-327 removal step; existing legacy refs in clusters/prod/monitoring/grafana-dashboards-protocol.yaml are summary-typed ({quantile="0.5"}), so query rewrites are histogram_quantile-based, not 1:1. No alert/recording rules touch these names — the dual-emit window is sufficient buffer.
| // InitAnteMetrics registers all OTel instruments for the ante package. | ||
| // Safe to call concurrently; instruments are registered exactly once. | ||
| func InitAnteMetrics() { | ||
| appAnteMetrics.mu.Lock() |
| var millisecondBuckets = metric.WithExplicitBucketBoundaries( | ||
| 0.000025, 0.000050, 0.0001, 0.0005, 0.001, 0.0025, 0.005, 0.010, 0.020, 0.050, 0.075, 0.1, | ||
| ) |
There was a problem hiding this comment.
nit: I'd add 0.25, 0.5, 1.0 — commitDuration is on this set, and pebble compaction stalls during heavy write load can push commit past 100ms (verified by walking through BaseApp.Commit → cms.Commit(true) — the foreground write blocks during L0→L1 stalls). 100ms ceiling buckets that exactly when you most want tail visibility.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7446303. Configure here.

Summary
Migrates telemetry in the
appandapp/antepackages from the legacytelemetry/utilmetricshelpers to the standardized OpenTelemetry Meter API, following the same pattern established in #3265 forevmrpc.app/metrics.gowith a struct-based OTel instrument set registered viaotel.Meter("app"), initialized once inNewAppafterSetupOtelMetricsProvider()app/ante/metrics.gowith a lazily-initializedanteMetricsstruct viasync.OnceValue(fires on first CheckTx, after the global MeterProvider is set)telemetry.MeasureSince/utilmetricscalls inabci.goandinvariance.gowith dual-emitted OTel instruments; legacy calls remain withTODO(PLT-327)markers until dashboards are migratedctx/ctx.Context()through all.Record()and.Add()call sites to support OTel's context-based propagationGaugeSeidVersionAndCommitcall with anapp_build_infoobservable gauge that fires on Prometheus scrape — no per-block overheadNew metrics (OTel naming convention, exported via the process-wide MeterProvider)
ABCI phase durations — histograms bucketed at SLO thresholds (p50 ≤ 500ms, p95 ≤ 1.5s, p99 ≤ 2.5s):
app_abci_begin_block_duration_secondsapp_abci_end_block_duration_secondsapp_abci_module_end_block_duration_secondsapp_abci_check_tx_duration_secondsapp_abci_deliver_tx_duration_secondsapp_abci_deliver_batch_tx_duration_secondsapp_block_process_duration_seconds— block tx processing duration by execution typeTransaction counters and gas:
app_tx_count_totalapp_tx_process_type_total— by execution type labelapp_tx_gas_total— cumulative gas by type labelapp_tx_gas_used/app_tx_gas_wanted— per-tx gaugesApp-level flow counters:
app_optimistic_processing_total— cache hit (enabled=true) vs missapp_failed_total_gas_wanted_check_totalapp_giga_fallback_to_v2_totalLight invariance:
app_lightinvariance_supply_duration_secondsapp_lightinvariance_supply_invalid_key_totalapp_lightinvariance_supply_unmarshal_failure_totalBuild info:
app_build_info— observable gauge, always 1, labels:seid_version,commitAnte:
app_pending_nonce_total— pending nonce events by type (added, expired, rejected, accepted)Migration note
Legacy metrics (
telemetry.MeasureSince,utilmetrics.MeasureDeliverTxDuration, etc.) are dual-emitted during the migration window so existing dashboards continue to receive data. Each legacy call site is annotatedTODO(PLT-327)and will be removed once dashboards are verified against the new OTel series.Note
Medium Risk
Touches hot ABCI/tx execution paths to add new OpenTelemetry metric recording and thread contexts through invariance checks; functional behavior should be unchanged but any mistakes could impact performance or panic on misconfigured meter providers.
Overview
Adds new OpenTelemetry Meter-based instrumentation for the
appandapp/antepackages viaapp/metrics.goandapp/ante/metrics.go, with one-time initialization inNew().Updates ABCI handlers (
BeginBlock,EndBlock,CheckTx,DeliverTx,DeliverTxBatch,Commit), block processing paths (sync/OCC/giga), and light invariance checks to record new OTel histograms/counters (durations, tx counts, gas totals, optimistic processing, gas-wanted rejections, giga fallbacks, invariance failures) while temporarily dual-emitting legacytelemetry/utilmetricsmetrics behindTODO(PLT-327).Replaces per-block build info gauge emission with an
app_build_infoobservable gauge and adjustsLightInvarianceChecksto accept acontext.Contextso OTel metrics can use the correct context.Reviewed by Cursor Bugbot for commit f2384dd. Bugbot is set up for automated code reviews on this repo. Configure here.